{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Comparing model performance with a simple baseline\n",
    "\n",
    "In this notebook, we present how to compare the generalization performance of\n",
    "a model to a minimal baseline. In regression, we can use the `DummyRegressor`\n",
    "class to predict the mean target value observed on the training set without\n",
    "using the input features.\n",
    "\n",
    "We now demonstrate how to compute the score of a regression model and then\n",
    "compare it to such a baseline on the California housing dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "section named \"Appendix - Datasets description\" at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_california_housing\n",
    "\n",
    "data, target = fetch_california_housing(return_X_y=True, as_frame=True)\n",
    "target *= 100  # rescale the target in k$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Across all evaluations, we will use a `ShuffleSplit` cross-validation splitter\n",
    "with 20% of the data held on the validation side of the split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import ShuffleSplit\n",
    "\n",
    "cv = ShuffleSplit(n_splits=30, test_size=0.2, random_state=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start by running the cross-validation for a simple decision tree regressor.\n",
    "Here we compute the testing errors in terms of the mean absolute error (MAE)\n",
    "and then we store them in a pandas series to make it easier to plot the\n",
    "results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "from sklearn.model_selection import cross_validate\n",
    "\n",
    "regressor = DecisionTreeRegressor()\n",
    "cv_results_tree_regressor = cross_validate(\n",
    "    regressor, data, target, cv=cv, scoring=\"neg_mean_absolute_error\", n_jobs=2\n",
    ")\n",
    "\n",
    "errors_tree_regressor = pd.Series(\n",
    "    -cv_results_tree_regressor[\"test_score\"], name=\"Decision tree regressor\"\n",
    ")\n",
    "errors_tree_regressor.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we evaluate our baseline. This baseline is called a dummy regressor.\n",
    "This dummy regressor will always predict the mean target computed on the\n",
    "training target variable. Therefore, the dummy regressor does not use any\n",
    "information from the input features stored in the dataframe named `data`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.dummy import DummyRegressor\n",
    "\n",
    "dummy = DummyRegressor(strategy=\"mean\")\n",
    "result_dummy = cross_validate(\n",
    "    dummy, data, target, cv=cv, scoring=\"neg_mean_absolute_error\", n_jobs=2\n",
    ")\n",
    "errors_dummy_regressor = pd.Series(\n",
    "    -result_dummy[\"test_score\"], name=\"Dummy regressor\"\n",
    ")\n",
    "errors_dummy_regressor.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now plot the cross-validation testing errors for the mean target baseline\n",
    "and the actual decision tree regressor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_errors = pd.concat(\n",
    "    [errors_tree_regressor, errors_dummy_regressor],\n",
    "    axis=1,\n",
    ")\n",
    "all_errors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "\n",
    "bins = np.linspace(start=0, stop=100, num=80)\n",
    "all_errors.plot.hist(bins=bins, edgecolor=\"black\")\n",
    "plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n",
    "plt.xlabel(\"Mean absolute error (k$)\")\n",
    "_ = plt.title(\"Cross-validation testing errors\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the generalization performance of our decision tree is far from\n",
    "being perfect: the price predictions are off by more than 45,000 US dollars on\n",
    "average. However it is much better than the mean price baseline. So this\n",
    "confirms that it is possible to predict the housing price much better by using\n",
    "a model that takes into account the values of the input features (housing\n",
    "location, size, neighborhood income...). Such a model makes more informed\n",
    "predictions and approximately divides the error rate by a factor of 2 compared\n",
    "to the baseline that ignores the input features.\n",
    "\n",
    "Note that here we used the mean price as the baseline prediction. We could\n",
    "have used the median instead. See the online documentation of the\n",
    "[sklearn.dummy.DummyRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html)\n",
    "class for other options. For this particular example, using the mean instead\n",
    "of the median does not make much of a difference but this could have been the\n",
    "case for dataset with extreme outliers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let us see what happens if we measure the test score using $R^2$\n",
    "instead of the mean absolute error:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_dummy = cross_validate(\n",
    "    dummy, data, target, cv=cv, scoring=\"r2\", return_train_score=True, n_jobs=2\n",
    ")\n",
    "r2_train_score_dummy_regressor = pd.Series(\n",
    "    result_dummy[\"train_score\"], name=\"Dummy regressor train score\"\n",
    ")\n",
    "r2_train_score_dummy_regressor.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The $R^2$ score is always 0. It can be shown that this is always the case,\n",
    "because of its mathematical definition. If you are interested in the proof,\n",
    "unfold the dropdown below.\n",
    "\n",
    "```{admonition} Mathematical explanation\n",
    ":class: dropdown\n",
    "Recall that the $R^2$ score is defined as:\n",
    "\n",
    "$\n",
    "R^2 = 1 - \\frac{\\sum_i (y_i - \\hat{y}_i)^2}{\\sum_i (y_i - \\bar{y})^2}\n",
    "$\n",
    "\n",
    "But our model always predicts the mean, i.e. for all $i$,\n",
    "$\\hat{y}_i = \\bar{y}$, so:\n",
    "\n",
    "$\n",
    "R^2 = 1 - \\frac{\\sum_i (y_i - \\hat{y}_i)^2}{\\sum_i (y_i - \\bar{y})^2}\n",
    "    = 1 - \\frac{\\sum_i (y_i - \\bar{y}  )^2}{\\sum_i (y_i - \\bar{y})^2}\n",
    "    = 1 - 1\n",
    "    = 0.\n",
    "$\n",
    "```\n",
    "\n",
    "This helps put your model's $R^2$ score in perspective: if your model has an\n",
    "$R^2$ score higher than 0 then it performs better than a `DummyRegressor` with\n",
    "`strategy=\"mean\"`; similarly, if the $R^2$ score is lower than 0 then your\n",
    "model is worse than the dummy regressor. For the test score, we observe\n",
    "something similar, but with an additional effect coming from the dataset\n",
    "variations: the mean target value measured on the testing set is slightly\n",
    "different from the mean target value measured on the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r2_test_score_dummy_regressor = pd.Series(\n",
    "    result_dummy[\"test_score\"], name=\"Dummy regressor test score\"\n",
    ")\n",
    "r2_test_score_dummy_regressor.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In conclusion, $R^2$ is a normalized metric, which makes it independent of the\n",
    "physical unit of the target variable, unlike MAE. A $R^2$ score of 0.0 is the\n",
    "performance of a model that always predicts the mean observed value of the\n",
    "target, while 1.0 corresponds to a model that predicts exactly the observed\n",
    "target variable for each given input observation. Notice that it is only\n",
    "possible to reach 1.0 if the target variable is a deterministic function of\n",
    "the available input features. In practice, external factors often introduce\n",
    "variability in the target that cannot be explained by the available features.\n",
    "Therefore, the $R^2$ score of an optimal model is typically less than 1.0, not\n",
    "due to a limitation of the machine learning algorithm itself, but because\n",
    "the chosen input features are fundamentally not informative enough to\n",
    "deterministically predict the target.\n",
    "\n",
    "Overall, $R^2$ represents the proportion of the target's variability explained\n",
    "by the model, while MAE, which retains the physical units of the target, can\n",
    "be helpful for reporting errors in those units."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}